Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

ncertainty in the raw multi dimensional data space is measured

o-variance matrices for two classes. They are denoted by Σଵ and

relationship between the mapping space variance and the raw

variance matrices is defined as below,

ܵௐൌܟ^௧ሺΣଶ൅Σଵሻܟ

(3.7)

ose the distance between two mapping centres is denoted by

ൌܟ^௧ሺ࢛ଶെ࢛ଵሻ. A discrimination ratio as the mapping quality

d with the discrimination power of a classifier is defined as below,

Rሺܟሻൌ^ܵ^஻

ܵௐ

ൌ^ܟ^௧^ሺ࢛^ଶ^െ࢛^ଵ^ሻ

ܟ^௧ሺΣଶ൅Σଵሻܟ

(3.8)

ܟሻ is optimised (maximised), the mapping vector w is said to be

ojection direction and is able to maximise the separation between

es in the raw space. In the projection space, the distance between

ping centres in the ݕො space will be maximised and the variance of

ers in the ݕො space will be minimised. LDA assumes that the two

ave an identical covariance matrix, hence an identical volume, i.e.,

ൌΣ. Maximising Rሺܟሻ leads to the solution of a LDA model

low [Duda, et al., 2000],

ܟෝ∝Σ^ିଵሺ࢛ଶെ࢛ଵሻ

(3.9)

er words, the calculation of the mean vectors (࢛ଵ and ࢛ଶ) and the

ce matrix (Σ) leads to the solution of the estimated or optimised

n direction (ܟෝ). A simple two-dimensional data set shown in

is used to demonstrate how a LDA model can be constructed

principle discussed above. Based on this data set, the mean

or each class can be calculated using the equation shown below,

࢛௞ൌ¹

ܰ௞

෍ܠ௡^௞

ேೖ

௡ୀଵ

(3.10)